arXiv: 1603.06042v 1 [cs.CL] 19 Mar 2016 


Globally Normalized Transition-Based Neural Networks 


Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, 
Kuzman Ganchev, Slav Petrov and Michael Collins 

Google Inc 
New York, NY 

{andor, chrisalberti,djweiss,severyn,apresta,kuzman,slav,mjcollins}@google,com 


Abstract 

We introduce a globally normalized 
transition-based neural network model 
that achieves state-of-the-art part-of- 
speech tagging, dependency parsing and 
sentence compression results. Our model 
is a simple feed-forward neural network 
that operates on a task-specific transition 
system, yet achieves comparable or better 
accuracies than recurrent models. The 
key insight is based on a novel proof 
illustrating the label bias problem and 
showing that globally normalized models 
can be strictly more expressive than 
locally normalized models. 

1 Introduction 

Neural network approaches have taken 
the field of natural language processing 
(NLP) by storm. In particular, variants of 
long short-term memory (LSTM) networks 
dHochreiter and Schmidhuber, 19971 1 have 

produced impressive results on some of the 
classic NLP tasks such as part-of-speech 
tagging ( |Ling et al., 2015) , syntactic parsing 
( [Vinyals et al., 2015] ) and semantic role labeling 
dZhou and Xu, 2015T ). One might speculate that 
it is the recurrent nature of these models that 
enables these results. 

In this work we demonstrate that simple 
feed-forward networks without any recurrence 
can achieve comparable or better accuracies 
than LSTMs, as long as they are globally 
norma liz ed. Our model, described in de¬ 
tail in Section |2j uses a transition system 
dNivre, 2006) 1 and feature embeddings as intro¬ 
duced by |Chen and Manning (2014| ). We do not 
use any recurrence, but perform beam search 
for maintaining multiple hypotheses and intro¬ 


duce global normalization with a conditional ran¬ 
dom field (CRF) objective ( jBottou et al., 19971 
|Le Cun et al., 19981 |Lafferty et al., 2001[ ) to over¬ 
come the label bias problem that locally normal¬ 
ized models suffer from. Since we use beam 
inference, we approximate the partition func¬ 
tion by summing over the elements in the beam, 
and use early updates dCollins and Roark, 2004 [ 
|Zhou et al., 2015| ). We compute gradients based 
on this approximate global normalization and per¬ 
form full backpropagation training of all neural 
network parameters based on the CRF loss. 

We revisit the label bias problem in Section [3] 
and provide a novel proof that globally normal¬ 
ized models are strictly more expressive than lo¬ 
cally normalized models. Lookahead features 
can partially mitigate this discrepancy, but can¬ 
not fully compensate for it — a point to which we 
return later. To empirically demonstrate the ef¬ 
fectiveness of global normalization, we evaluate 
our model on part-of-speech tagging, syntactic de¬ 
pendency parsing and sentence compression (Sec¬ 
tion |4]). Our model achieves state-of-the-art ac¬ 
curacy on all of these tasks, matching or outper¬ 
forming LSTMs while being significantly faster. 
In particular for dependency parsing on the Wall 
Street Journal we achieve the best-ever published 
unlabeled attachment score of 94.41%. 

As discussed in more detail in Section |5l 
we also outperform previous structured training 
approaches used for neural network transition- 
based parsing. Our ablation experiments 
show that we outperform |Weiss et al. (2015[ ) and 
|Alberti et al. (201 5| because we do global back- 
propagation training of all model parameters, 
while they fix the neural network parameters when 
training the global part of their model. We 
also outperform |Zhou et al. (20151 despite using a 
smaller beam. To shed additional light on the la¬ 
bel bias problem in practice, we provide a sentence 



























compression example where the local model com¬ 
pletely fails. We then demonstrate that a globally 
normalized parsing model without any lookahead 
features is almost as accurate as our best model, 
while a locally normalized model loses more than 
10% absolute in accuracy because it cannot effec¬ 
tively incorporate evidence as it becomes avail¬ 
able. 

2 Model 

At its core, our model is an incremental transition- 
based parser dNivre, 2006] ) . To apply it to different 
tasks we only need to adjust the transition system 
and the input features. 

2.1 Transition System 

Given an input x, most often a sentence, we define: 

• A set of states S. 

• A special start state e S. 

• A set of allowed decisions .A(s) for all s G S. 

• A transition function t(s, d ) returning a new 
state s' for any decision d G A(s). 

We drop the dependence on x for brevity. We will 
use a function p(s. d\ 6) to compute the score of 
decision d in state s. The vector 9 contains the 
model parameters and we assume that p(s, d; 9) is 
differentiable with respect to 9. 

Throughout this work we will use transition sys¬ 
tems in which all complete structures for the same 
input x have the same number of decisions n(x) 
(or n for brevity). In dependency parsing for ex¬ 
ample, this is true for both the arc-standard and 
arc-eager transition systems dNivre, 2006] ), where 
for a sentence x of length m, the number of deci¬ 
sions for any complete parse is n(x) = 2 x m[] 
A complete structure is then a sequence of deci¬ 
sion/state pairs (si, d\ )... (s n , d n ) such that si = 
st, di G S(si ) for % = l...n, and s* + i = 
t(si, di). We use the notation d\ :J to refer to a de¬ 
cision sequence d\ ... dj. 

We assume that there is a one-to-one mapping 
between decision sequences d\-j and states sf that 
is, we essentially assume that a state encodes the 
entire history of decisions. Thus, each state can be 
reached by a unique decision sequence from st H 
We will use decision sequences d\-j and states in¬ 
terchangeably: in a slight abuse of notation, we 

'Note that this is not true for the swap transition system 
defined in [lWT2009) . 

2 It is straightforward to extend the approach to make use 
of dynamic programming in the case where the same state 
can be reached by multiple decision sequences. 


define p(d± : j,d; 9) to be equal to p{s, d: 9) where 
s is the state reached by decisions d\ : j. 

The scoring function p(s, d; 9) can be defined 
in a number of ways. In this work, following 
|Chen and Manning (2014) ), |Weiss et al. (2015) ), 
and ]Zhou et al. (2015| ), we define it via a feed¬ 
forward neural network as 

p(s,d-,9) = 4>(s;9^)-9^. 

Here 6® are the parameters of the neural network, 
excluding the parameters at the final layer. 9^ are 
the final layer parameters for decision d. cf>(s; #W) 
is the representation for state s computed by the 
neural network under parameters 0W. Note that 
the score is linear in the parameters 9^ d \ We next 
describe how softmax-style normalization can be 
performed at the local or global level. 

2.2 Global vs. Local Normalization 

In the |Chen and Manning (2014| ) style of greedy 
neural network parsing, the conditional probabil¬ 
ity distribution over decisions dj given context 
di:j-\ is defined as 


p(dj\di : j-i; 9) 
where 


exp p(di-.j-i,dj] 9) 
Z L (di:j-i]9) 


Z L (di : j-i;9) = Y exp p(di : j-i,d'-,9). 
d’eA(d 1:j -i) 


Each Zp(di-.j -\; 9) is a local normalization term. 
The probability of a sequence of decisions c?i :n is 


PL(d 1:n ) = Ylp(.dj\duj-i]9) 

i =t 

ex P EjLi p{dv.j-i ) dj;9) 

Beam search can be used to attempt to find the 
maximum of © with respect to d\ :n . 

In contrast, a Conditional Random Field (CRF) 
defines a distribution PG{d\ :n ) as follows: 


PG{dv. n ) 

where 


ex P YTj= i p{di:j-i,dj-9) 
Zg$) 


( 3 ) 


Zg{6) = 


Y ex p Y ’ d 'j >■ 

d' 1 :n ev n j=l 




















3 The Label Bias Problem 


and V n is the set of all valid sequences of deci¬ 
sions of length n. Zq(0) is a global normalization 
term. The inference problem is now to find 

argmaxpc(^i:n) = argrnax p(di-.j -\, dj ; 0). 
d\:n€V n d,i : „£V„ j =l 

Beam search can again be used to approximately 
find the argrnax. 

2.3 Training 

Training data consists of inputs x paired with gold 
decision sequences d*. n . We use stochastic gradi¬ 
ent descent on the negative log-likelihood of the 
data under the model. Under a locally normalized 
model, the negative log-likelihood is 

£iocai(di :ri ; 8) = - In p L (d*. n -8) = (4) 

- J2 0) + In 8), 

3=1 j=l 

whereas under a globally normalized model it is 

^global = - In PG{dt :n ;8) = 

-Y j p(d\, j _ l) d*:8) + \nZ G (8). (5) 

3-1 

A significant practical advantange of the locally 
norm aliz ed cost © is that it factorizes into n in¬ 
dependent terms, each of which can be computed 
exactly and minimized separately. By contrast, the 
Zq term in © contains a sum over d' 1:n £ V n that 
is in many cases intractable. 

To make learning tractable with the glob¬ 
ally normalized model, we use beam search 
and early updates ( {Collins and Roark, 2004 j 
|Zhou et al., 2015| >. As the training sequence is 
being decoded, we keep track of the location of 
the gold path in the beam. If the gold path falls 
out of the beam at step j, a stochastic gradient 
step is taken on the following objective: 

-^global—beam 0) = 

— ^^/5(di:i_i, d* ; ff) + ln^ exp^ p(d' 1;i _i, d'; 9).(6) 
PI d^. j 6 M 

Here the set Bj contains all paths in the beam at 
step j, together with the gold path prefix d\.y It 
is straightforward to derive gradients of the loss 
in © and to back-propagate gradients to all levels 
of a neural network defining the score p(s,d;0). 
If the gold path remains in the beam throughout 
decoding, a gradient step is performed using B n , 
the beam at the end of decoding. 


Intuitively, we would like the model to be 
able to revise an earlier decision made during 
search, when later evidence becomes available that 
rules out the earlier decision as incorrect. At 
first glance, it might appear that a locally nor¬ 
malized model used in conjunction with beam 
search or exact search is able to revise ear¬ 
lier decisions. However the label bias prob¬ 
lem (see |Lafferty et al. (2001| ), |Bottou (199T] >, 
|Bottou and LeCun (2005| l) means that locally nor¬ 
malized models often have a very weak ability to 
revise earlier decisions. 

This section gives a more formal perspective 
on the label bias problem than in previous work, 
through a proof that globally normalized models 
are strictly more expressive than locally normal¬ 
ized models. The proof makes use of an example 
that gives an illustration of the label bias problem. 

Global Models can be Strictly More Expressive 
than Local Models Consider a tagging problem 
where the task is to map an input sequence x\ :n 
to a decision sequence d\ :n . First, consider a lo¬ 
cally normalized model where we restrict the scor¬ 
ing function to access only the first i input sym¬ 
bols xi-i when scoring decision dj. We will re¬ 
turn to this restriction soon. The scoring function 
p can be an otherwise arbitrary function of the tu¬ 
ple (d\:i-i, di, X\:i): 


PL{dv.n\xv. n ) = Y\pL(di\d 1 :i -i,x 1:i ) 

exp P(dl:i—1 ) di,X\:i) 

nr=i z L{di-.i-i,xi,i) 


Second, consider a globally normalized model 


PG{dl:n\xi:n) 


expYli=iP(di:i-i,di,x 1 : i) 

%(*!:«) 


This model again makes use of a scoring function 
p(d\:i-i, di, x\:i) restricted to the first i input sym¬ 
bols when scoring decision di. 

Define Vl to be the set of all possible distribu¬ 
tions PL(di- n \x\ :n ) under the local model obtained 
as the scores p vary. Similarly, define Vg to be the 
set of all possible distributions PG{di-. n \x\-.n) un¬ 
der the global model. Here a “distribution” is a 
function from a pair (x'i ;n , d\- n ) to a probability 
p(d\ :n \xi- n ). Our main result is the following: 
Theorem 3.1 

Vl is a strict subset o/Vg, that is Vl U Vg- 














To prove this we will first prove that VlQ Vg- 
This step is straightforward. We then show that 
Vg £ Vl\ that is, there are distributions in Vg 
that are not in Vl- The proof that Vg £ Vl gives 
a clear illustration of the label bias problem. 

Proof that Vl C Vg •' We need to show that 
for any locally normalized distribution pl, we can 
construct a globally norm aliz ed model pc such 
that pg = Pl- Consider a locally normalized 
model with scores p(di : i-i,di,xi-.i). Define a 
global model pg with scores 

p'(di : i^i,di,Xl:i) = \ogp L (di\di:i-i,Xl:i). 
Then it is easily verified that 

PG(dl:n\xi:n) = PL(rfl:n \xi :n ) 
for all x\ :n , di :n . □ 

In proving Vg £ Vl we will use a simple prob¬ 
lem where every example seen in training or test 
data is one of the following two tagged sentences: 

a:i^2®3 = abc, G?id2^3 = ABC 
x\X 2 X% = a b e, d^dz = A D E (7) 

Note that the input X 2 = b is ambiguous: it can 
take tags B or D. This ambiguity is resolved when 
the next input symbol, c or e, is observed. 

Now consider a globally normalized model, 
where the scores p(d\ : i-\,di,xi :i ) are de¬ 
fined as follows. Define T as the set 
{(A,B),(B,C),(A,D),(D,E)} of bigram tag 
transitions seen in the data. Similarly, define £ 
as the set {(a, A), (6, B), (c, C), (6, D), (e, E)} of 
(word, tag) pairs seen in the data. We define 

p{dm-i,di,xi,i) ( 8 ) 

= a x {(di-i, di) € 71 + ol x [(a;*, df) € £\ 

where a is the single scalar parameter of the 
model, and [7rJ = 1 if 7r is true, 0 otherwise. 

Proof that Vg ^ Vl- We will construct a glob¬ 
ally normalized model pc such that there is no lo¬ 
cally normalized model such that pl = Pg- 

Under the definition in ®, it is straightforward 
to show that 

lim Pg{A B C|ab c) = lim p(j(ADE|abe) = 1. 

In contrast, under any definition for 
p(di : i-i,di, xi:i), we must have 

Pi(A B C|a b c) +pl(A D E|a b e) < 1 (9) 


This follows because pl(A B C|a b c) = 
Pl (A I a) x pl(B|A, ab) x pl(C|A B, a b c) 
and pL(ADE|abe) = pl{ A|a) x 
Pl(D|A, ab) x p L (E|AD,abe). The in¬ 
equality pz,(B|A,ab) + px(D|A,ab) < 1 then 
immediately implies (0. 

It follows that for sufficiently large values of a, 
wehavepc(A B C|a b c) +pg( A D E|ab e) > 1, 
and given © it is impossible to de¬ 
fine a locally normalized model with 
PL(ABC|abc) = pG(ABC|abc) and 
Pl( A D E|a be) = pc( A D E|a b e). □ 

Under the restriction that scores 
p(di : i-i,di,xi : i) depend only on the first i 
input symbols, the globally norma liz ed model 
is still able to model the data in ©, while the 
locally normalized model fails (see Eq. ©. The 
ambiguity at input symbol b is naturally resolved 
when the next symbol (c or e) is observed, but 
the locally norm aliz ed model is not able to revise 
its prediction. 

It is easy to fix the locally normalized model 
for the example in © by allowing scores 
p{di:i-i,di,xi : i + i) that take into account the in¬ 
put symbol Zj+i. Such lookahead is common in 
practice, but insufficient in general. For every 
amount of lookahead k, we can construct exam¬ 
ples that cannot be modeled with a locally nor¬ 
malized model by duplicating the middle input 
b in © k + 1 times. Only a local model with 
scores p(di : j_i, di, x\ :n ) that considers the entire 
input can capture any distribution p{d\-_ n \x\ :n )\ 
in this case the decomposition PL(d\ :n \x\ :n ) = 
n"=i PL(di\d]-.i-\, x\ :n ) makes no independence 
assumptions. 

However, increasing the amount of context used 
as input comes at a cost, requiring more powerful 
learning algorithms, and potentially more train¬ 
ing data. For a detailed analysis of the trade¬ 
offs between structural features in CRFs and more 
powerful local classifiers without structural con¬ 
straints, see |Liang et al. (2008| ); in these exper¬ 
iments local classifiers are unable to reach the 
performance of CRFs on problems such as pars¬ 
ing and named entity recognition where structural 
constraints are important. Note that there is noth¬ 
ing to preclude an approach that makes use of both 
global normalization and more powerful scoring 
functions p(di : j_i, di, x\ :n ), obtaining the best of 
both worlds. The experiments that follow make 
use of both. 




En En-Union CoNLL ’09 Avg 


Method 

WSJ 

News 

Web 

QTB 

Ca 

Ch 

Cz 

En 

Ge 

Ja 

Sp 


Linear CRF 

97.17 

97.60 

94.58 

96.04 

98.81 

94.45 

98.90 

97.50 

97.14 

97.90 

98.79 

97.17 

iLingetal. (2015) 

97.78 

97.44 

94.03 

96.18 

98.77 

94.38 

99.00 

97.60 

97.84 

97.06 

98.71 

97.16 

Our Local (B=l) 

97.44 

97.66 

94.46 

96.59 

98.91 

94.56 

98.96 

97.36 

97.35 

98.02 

98.88 

97.29 

Our Local (B=8) 

97.45 

97.69 

94.46 

96.64 

98.88 

94.56 

98.96 

97.40 

97.35 

98.02 

98.89 

97.30 

Our Global (B=8) 

97.44 

97.77 

94.80 

96.86 

99.03 

94.72 

99.02 

97.65 

97.52 

98.37 

98.97 

97.47 


Table 1: Final POS tagging test set results on English WSJ and Treebank Union as well as CoNLL’09. 


4 Experiments 

To demonstrate the flexibility and modeling power 
of our approach, we provide experimental results 
on a diverse set of structured prediction tasks. We 
first direct our attention to POS tagging, then to 
syntactic dependency parsing and finally to sen¬ 
tence compression. 

While directly optimizing the global model © 
works well, we found that training the model in 
two steps achieves the same precision much faster: 
we first pretrain the network using the local ob¬ 
jective (SI), and then perform additional training 
steps using the global objective ©. We pretram 
all layers except the softmax layer in this way. We 
purposefully abstain from complicated hand en¬ 
gineering of input features, which might improve 
performance further dDurrett and Klein, 20151 ). 

4.1 Part of Speech Tagging 

Part of speech (POS) tagging is a classic NLP task, 
where modeling the structure of the output is im¬ 
portant for achieving state-of-the-art performance. 

Data & Evaluation. We conducted exper¬ 
iments on a number of different datasets: 
(1) English Wall Street Journal (WSJ) part 
of the Penn Treebank ( [Marcus et al., 1993| ) 
with standard POS tagging splits; (2) En¬ 
glish “Treebank Union” multi-domain corpus 
containing data from the OntoNotes corpus 
version 5 ( |Hovy et al., 2006| ), the English Web 
Treebank dPetrov and McDonald, 2012| ), and 
the updated and corrected Question Treebank 
dJudge et al., 2006| ) with identical setup to 
IWeiss et al. (20l5] >; and (3) CoNLL ’09 multi¬ 
lingual shared task ( [Hajic et al., 2009[ ). 

Model Configuration. Inspired by the inte¬ 
grated POS tagging and parsing transition system 
of |Bohnet and Nivre (2012| ), we employ a simple 
transition system that uses only a Shift action and 
predicts the POS tag of the current word on the 
buffer as it gets shifted to the stack. We extract the 


following features on a window ±3 tokens cen¬ 
tered at the current focus token: word, cluster, 
character n-gram up to length 3. We also extract 
the tag predicted for the previous 4 tokens. The 
network in these experiments has a single hidden 
layer with 256 units on WSJ and Treebank Union 
and 64 on CoNLL’09. 

Results. In Table |T| we compare our model to 
a linear CRF and to the compositional character- 
to-word LSTM model of |Ling et al. (2015| ). The 
CRF is a first-order linear model with exact infer¬ 
ence and the same emission features as our model. 
It additionally also has transition features of the 
word, cluster and character n-gram up to length 3 
on both endpoints of the transition. The results for 
|Ling et al. (2015| ) were solicited from the authors. 

Our local model already compares favorably 
against these methods on average. Using beam 
search with a locally normalized model does not 
help, but with global normalization it leads to a 
7% reduction in relative error, empirically demon¬ 
strating the effect of label bias. It is also inter¬ 
esting to note that the set of character ngrams fea¬ 
ture is very important, increasing average accuracy 
on the CoNLL’09 datasets by about 0.5% abso¬ 
lute. This shows that character-level modeling can 
also be done with a simple feed-forward netowork 
without recurrence. 

4.2 Dependency Parsing 

In dependency parsing the goal is to produce a di¬ 
rected tree representing the syntactic structure of 
the input sentence. 

Data & Evaluation. We use the same corpora 
as in our POS tagging experiments, except that 
we use the standard parsing splits of the WSJ. We 
convert the English constituency trees to Stanford 
style dependencies ( |De Mameffe et al., 20061 ) us¬ 
ing version 3.3.0 of the converter. For English, 
we use predicted POS tags (the same POS tags 
are used for all models) and exclude punctua- 































WSJ 

Union-News 

Union-Web 

Union-QTB 

Method 

UAS 

LAS 

UAS 

LAS 

UAS 

LAS 

UAS 

LAS 

iMartins et al. (2013} 

92.89 

90.55 

93.10 

91.13 

88.23 

85.04 

94.21 

91.54 

|Zhang and McDonald (2014|> 

93.22 

91.02 

93.32 

91.48 

88.65 

85.59 

93.37 

90.69 

IWeiss et al. (2015H 

93.99 

92.05 

93.91 

92.25 

89.29 

86.44 

94.17 

92.06 

|Alberti et al. (2015!) 

94.23 

92.36 

94.10 

92.55 

89.55 

86.85 

94.74 

93.04 

Our Local (B=l) 

93.17 

91.18 

93.11 

91.46 

88.42 

85.58 

92.49 

90.38 

Our Local (B=32) 

93.58 

91.66 

93.65 

92.03 

88.96 

86.17 

93.22 

91.17 

Our Global (B=32) 

94.41 

92.55 

94.44 

92.93 

90.17 

87.54 

95.40 

93.64 


Table 2: Final English dependency parsing test set results (without tri-training for any method). 



Catalan 

Chinese 

Czech 

English 

German 

Japanese 

Spanish 

Method 

UAS LAS 

UAS LAS 

UAS LAS 

UAS LAS 

UAS LAS 

UAS LAS 

UAS LAS 

Best Shared Task Result 

- 87.86 

- 79.17 

- 80.38 

- 89.88 

- 87.48 

- 92.57 

- 87.64 

IBallesteros et al. (20151 

90.22 86.42 

80.64 76.52 

79.87 73.62 

90.56 88.01 

88.83 86.10 

93.47 92.55 

90.38 86.59 

Zhang and McDonald (20141 91.41 87.91 

82.87 78.57 

86.62 80.59 

92.69 90.01 

89.88 87.38 

92.82 91.87 

90.82 87.34 

Lei et al. (2014}- 

91.33 87.22 

81.67 76.71 

88.76 81.77 

92.75 90.00 

90.81 87.81 

94.04 91.84 

91.16 87.38 

Bohnet and Nivre (2012) 

92.44 89.60 

82.52 78.51 

88.82 83.73 

92.87 90.60 

91.37 89.38 

93.67 92.63 

92.24 89.60 

| Alberti et al. (2015) 

92.31 89.17 

83.57 79.90 

88.45 83.57 

92.70 90.56 

90.58 88.20 

93.99 93.10 

92.26 89.33 

Our Local (B=l) 

91.24 88.21 

81.29 77.29 

85.78 80.63 

91.44 89.29 

89.12 86.95 

93.71 92.85 

91.01 88.14 

Our Local (B=16) 

91.91 88.93 

82.22 78.26 

86.25 81.28 

92.16 90.05 

89.53 87.4 

93.61 92.74 

91.64 88.88 

Our Global (B=16) 

92.67 89.83 

84.72 80.85 

88.94 84.56 

93.22 91.23 

90.91 89.15 

93.65 92.84 

92.62 89.95 


Table 3: Final CoNLL ’09 dependency parsing test set results. 


tion from the evaluation, as is standard. For the 
CoNLL ’09 datasets we follow standard practice 
and include all punctuation in the evaluation. We 
follow |Alberti et al. (2015] ) and use our own pre¬ 
dicted POS tags so that we can include a k-best tag 
feature (see below) but use the supplied predicted 
morphological features. We report unlabeled and 
labeled attachment scores (UAS/LAS). 

Model Configuration. Our model configuration 
is basically the same as the one originally pro¬ 
posed by |Chen and Manning (2014| ) and then re¬ 
fined by |Weiss et al. (2015 ). In particular, we use 
the arc-standard transition system and extract the 
same set of features as prior work: words, part of 
speech tags, and dependency arcs and labels in the 
surrounding context of the state, as well as k-best 
tags as proposed by |Alberti et al. (2015| . We use 
two hidden layers of 1,024 dimensions each. 

Results. Tables |2] and Table [3] show our final 
parsing results and a comparison to the best sys¬ 
tems from the literature. We obtain the best ever 
published results on almost all datasets, including 
the WSJ. The results in Table |2] are without tri¬ 
training. When we use tri-training, our WSJ accu¬ 
racy improves to 94.61/92.78 (UAS/LAS), which 
compares favorably to the 94.26/92.41 reported 
by |Weiss et al. (2015[ ) with tri-training. As we 


show in Section |5J these gains can be attributed 
to the full backpropagation training that differenti¬ 
ates our approach from that of | Weiss et al. (2015] ) 
and |Alberti et al. (2015[ ). Our results also signifi¬ 
cantly outperform the LSTM-based approaches of 
|Dyer et al. (2015| ) and |Ballesteros et al. (2015[ ). 

4.3 Sentence Compression 

Our final structured prediction task is extractive 
sentence compression. 

Data & Evaluation. We follow 
|Filippova et al. (2015] ), where a large news 
collection is used to heuristically generate com¬ 
pression instances. Our final corpus contains 
about 2.3M compression instances: we use 2M 
examples for training, 130k for development and 
160k for the final test. We report per-token FI 
score and per-sentence accuracy (A), i.e. per¬ 
centage of instances that fully match the golden 
compressions. Following |Filippova et al. (20l5] > 
we also run a human evaluation on 200 sentences 
where we ask the raters to score compressions for 
readability (read) and informativeness (info) 
on a scale from 0 to 5. 

Model Configuration. The transition system 
for sentence compression is similar to POS tag¬ 
ging: we scan sentences from left-to-right and la- 














































UAS LAS 


Generated corpus Human eval 
Method A FI read info 

IFilippova et al. (2015D 35.36 82.83 4.66 4.03 

Automatic - - 4.31 3.77 

Our Local (B=l) 30.51 78.72 4.58 4.03 

Our Local (B=8) 31.19 75.69 

Our Global (B=8) 35.16 81.41 4.67 4.07 


Table 4: Sentence compression results on News data. Auto¬ 
matic refers to application of the same automatic extraction 
rules used to generate the News training corpus. 

bel each token as keep or drop. We extract fea¬ 
tures from words, POS tags, and dependency la¬ 
bels from a window of tokens centered on the in¬ 
put, as well as features from the history of predic¬ 
tions. We use a single hidden layer of size 400. 

Results. Table 0] shows our sentence compres¬ 
sion results. Our globally normalized model again 
significantly outperforms the local model. Beam 
search with a locally normalized model suffers 
from severe label bias issues that we discuss on 
a concrete example in Section |5j We also com¬ 
pare to the best sentence compression system from 
IFilippova et al. (2015| >, a 3-layer stacked LSTM 
which uses dependency label information. The 
LSTM and our global model perform on par on 
both the automatic evaluation as well as the hu¬ 
man ratings, but our model is roughly lOOx faster. 
All compressions kept approximately 42% of the 
tokens on average and all the models are signifi¬ 
cantly better than the automatic extractions (p < 
0.05). 

5 Discussion 

We derived a proof for the label bias problem 
and the advantages of global models. We then 
emprirically verified this theoretical superiority 
by demonstrating state-of-the-art performance on 
three different tasks. Our experiments showed 
consistent improvements in accuracy for globally 
normalized models over locally normalized mod¬ 
els with beam search. In this section we situate and 
compare our model to previous work and provide 
two examples of the label bias problem in practice. 

5.1 Related Neural CRF Work 

Neural network models have been been combined 
with conditional random fields and globally 
normalized models before. |Bottou et al. (1997} 
and |Le Cun et al. (1998] ) describe global train- 


Method 

Local (B=l) 92.85 90.59 

Local (B=16) 93.32 91.09 

Global (B=16) {6 {d) } 93.45 91.21 

Global (B=16){W 2 ,0( d )} 94.01 91.77 

Global (B=16) {W 1 ,W 2 , 0W) 94.09 91.81 

Global (B=16) (full) 94.38 92.17 


Table 5: WSJ dev set scores for successively deeper levels 
of backpropagation. The full parameter set corresponds to 
backpropagation all the way to the embeddings. W t : hidden 
layer i weights. 

ing of neural network models for structured 
prediction problems. |Peng et al. (2009| ) add 
a non-linear neural network layer to a linear- 
chain CRF and |Do and Artires (2010] ) apply 
a similar approach to more general Markov 
network structures. |Yao et al. (2014j ) and 
|Zheng et al. (2015] > introduce recurrence into the 
model and |Huang et al. (2015| ) finally combine 
CRFs and LSTMs. These neural CRF models are 
limited to sequence labeling tasks where exact 
inference is possible, while our model works well 
when exact inference is intractable. 

5.2 Related Transition-Based Parsing Work 

For early work on neural-networks for 
transition-based parsing, see Henderson (120031 
120041) . Our work is closest to the work of 
Weiss et al. (20151) , |Zhou et al. (2~0l5T ) and 
Watanabe and Sumita (2015| ); in these approaches 
global normalization is added to the local model 
of |Chen and Manning (2014| ). Empirically, 
IWeiss et al. (2015| ) achieves the best performance, 
even though their model keeps the parameters of 
the locally normalized neural network fixed and 
only trains a perception that uses the activations 
as features. Their model is therefore limited in 
its ability to revise the predictions of the locally 
normalized model. In Table [5] we show that full 
backpropagation training all the way to the word 
embeddings is very important and significantly 
contributes to the performance of our model. We 
also compared training under the CRF objective 
with a Perceptron-like hinge loss between the 
gold and best elements of the beam. When we 
limited the backpropagation depth to training only 
the top layer 9^ d \ we found negligible differences 
in accuracy: 93.20% and 93.28% for the CRF 
objective and hinge loss respectively. However, 







































Method 

Predicted compression 


PL 

pa 

Local (B=l) 
Local (B=8) 
Global (B=8) 

In Pakistan, form Pervez Musharraf ha'- appeared in court for the first time 

In Pakistan, former leader Pervez Musharraf has appeared in court for the first time 
Pervez Musharraf has appeared 

:, on treason charges. 
;, on treason charges, 
on treason charges. 

0.13 

0.16 

0.06 

0.05 

< to -4 

0.07 


Table 6: Example sentence compressions where the label bias of the locally normalized model leads to a breakdown during 
beam search. The probability of each compression under the local (j>l) and global (pc) models shows that only the global 
model can properly represent zero probability for the empty compression. 


when training with full backpropagation the CRF 
accuracy is 0.2% higher and training converged 
more than 4x faster. 

|Zhou et al. (2015| ) perform full backpropaga¬ 
tion training like us, but even with a much 
larger beam, their performance is significantly 
lower than ours. We also apply our model 
to two additional tasks, while they experi¬ 
ment only with dependency parsing. Finally, 
IWatanabe and Sumita (2015] ) introduce recurrent 
components and additional techniques like max- 
violation updates for a corresponding constituency 
parsing model. In contrast, our model does not re¬ 
quire any recurrence or specialized training. 

5.3 Label Bias in Practice 

We observed several instances of severe label bias 
in the sentence compression task. Although us¬ 
ing beam search with the local model outperforms 
greedy inference on average, beam search leads 
the local model to occasionally produce empty 
compressions (Table [6j). It is important to note 
that these are not search errors: the empty com¬ 
pression has higher probability under pp than the 
prediction from greedy inference. However, the 
more expressive globally normalized model does 
not suffer from this limitation, and correctly gives 
the empty compression almost zero probability. 

We also present some empirical evidence that 
the label bias problem is severe in parsing. We 
trained models where the scoring functions in 
parsing at position i in the sentence are limited to 
considering only tokens x\-.P hence unlike the full 
parsing model, there is no ability to look ahead 
in the sentence when making a decision!! The 
result for a greedy model under this constraint 
is 76.96% UAS; for a locally normalized model 
with beam search is 81.35%; and for a globally 
normalized model is 93.60%. Thus the globally 
normalized model gets very close to the perfor- 

3 This setting may be important in some applications, 
where for example parse structures for sentence prefixes are 
required, or where the input is received one word at a time 
and online processing is beneficial. 


mance of a model with full lookahead, while the 
locally normalized model with a beam gives dra¬ 
matically lower performance. In our final exper¬ 
iments with full lookahead, the globally normal¬ 
ized model achieves 94.01% accuracy, compared 
to 93.07% accuracy for a local model with beam 
search. Thus adding lookahead allows the lo¬ 
cal model to close the gap in performance to the 
global model; however there is still a significant 
difference in accuracy, which may in large part be 
due to the label bias problem. 

A number of authors have considered modified 
training procedures for greedy models, or for lo¬ 
cally normalized models. |Daume HI et al. (2009| ) 
introduce Seam, an algorithm that allows a 
classifier making greedy decisions to become 
more robust to errors made in previous deci¬ 
sions. |Goldberg and Nivre (2013| > describe im¬ 
provements to a greedy parsing approach that 
makes use of methods from imitation learn¬ 
ing et al., 201T1 ) to augment the training 
set. Note that these methods are focused on 
greedy models: they are unlikely to solve the 
label bias problem when used in conjunction 
with beam search, given that the problem is 
one of expressivity of the underlying model. 
More recent work ( |Yazdani and Henderson, 20151 
|Vaswani and Sagae, 2016| ) has augmented locally 
normalized models with correctness probabilities 
or error states, effectively adding a step after every 
decision where the probability of correctness of 
the resulting structure is evaluated. This gives con¬ 
siderable gains over a locally normalized model, 
although performance is lower than our full glob¬ 
ally normalized approach. 

6 Conclusions 

We presented a simple and yet powerful model ar¬ 
chitecture that produces state-of-the-art results for 
POS tagging, dependency parsing and sentence 
compression. Our model combines the flexibil¬ 
ity of transition-based algorithms and the model¬ 
ing power of neural networks. Our results demon- 




















strate that feed-forward network without recur¬ 
rence can outperform recurrent models such as 
LSTMs when they are trained with global normal¬ 
ization. We further support our empirical findings 
with a proof showing that global normalization 
helps the model overcome the label bias problem 
from which locally normalized models suffer. 
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